AWE-CM Vectors: Augmenting Word Embeddings with a Clinical Metathesaurus

نویسندگان

  • Willie Boag
  • Hassan Kané
چکیده

In recent years, word embeddings have been surprisingly effective at capturing intuitive characteristics of the words they represent. These vectors achieve the best results when training corpora are extremely large, sometimes billions of words. Clinical natural language processing datasets, however, tend to be much smaller. Even the largest publicly-available dataset of medical notes is three orders of magnitude smaller than the dataset of the oft-used “Google News” word vectors. In order to make up for limited training data sizes, we encode expert domain knowledge into our embeddings. Building on a previous extension of word2vec, we show that generalizing the notion of a word’s “context” to include arbitrary features creates an avenue for encoding domain knowledge into word embeddings. We show that the word vectors produced by this method outperform their textonly counterparts across the board in correlation with clinical experts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word embeddings and recurrent neural networks based on Long-Short Term Memory nodes in supervised biomedical word sense disambiguation

Word sense disambiguation helps identifying the proper sense of ambiguous words in text. With large terminologies such as the UMLS Metathesaurus ambiguities appear and highly effective disambiguation methods are required. Supervised learning algorithm methods are used as one of the approaches to perform disambiguation. Features extracted from the context of an ambiguous word are used to identif...

متن کامل

Learning Better Embeddings for Rare Words Using Distributional Representations

There are two main types of word representations: low-dimensional embeddings and high-dimensional distributional vectors, in which each dimension corresponds to a context word. In this paper, we initialize an embedding-learning model with distributional vectors. Evaluation on word similarity shows that this initialization significantly increases the quality of embeddings for rare words.

متن کامل

Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model

We propose a Word-Topic Mixture(WTM) model to improve word representation and topic model simultaneously. Firstly, it introduces the initial external word embeddings into the Topical Word Embeddings(TWE) model based on Latent Dirichlet Allocation(LDA) model to learn word embeddings and topic vectors. Then the results learned from TWE are integrated in the LDA by defining the probability distrib...

متن کامل

Online Learning of Interpretable Word Embeddings

Word embeddings encode semantic meanings of words into low-dimension word vectors. In most word embeddings, one cannot interpret the meanings of specific dimensions of those word vectors. Nonnegative matrix factorization (NMF) has been proposed to learn interpretable word embeddings via non-negative constraints. However, NMF methods suffer from scale and memory issue because they have to mainta...

متن کامل

Best of Both Worlds: Making Word Sense Embeddings Interpretable

Word sense embeddings represent a word sense as a low-dimensional numeric vector. While this representation is potentially useful for NLP applications, its interpretability is inherently limited. We propose a simple technique that improves interpretability of sense vectors by mapping them to synsets of a lexical resource. Our experiments with AdaGram sense embeddings and BabelNet synsets show t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.01460  شماره 

صفحات  -

تاریخ انتشار 2017